Data Overview

In this notebook:

  • Training data is loaded
  • Data exploration via pandas profiling
  • Samples without text are deleted
  • Text sections sourrounding the gene variation name are extracted

Key insight:

  • Training Samples 3321
  • 5 Samples without Text
  • Unbalanced Dataset

Imports

In [1]:
import sys
import pandas as pd
import os
from pandas_profiling import ProfileReport
import numpy as np

sys.path.append("../utils/")

from preprocessing import extract_text_sections
from preprocessing import get_data

Load Raw Data

Load, explore, and prepare all required data.

In [2]:
data_path = "../../data/msk-redefining-cancer-treatment"
In [3]:
# Training Data - Text and Genetic Variants Information
training_merge_df = get_data(
    text_file_path="raw/training_text", variants_file_path="raw/training_variants"
)
training_size = training_merge_df.shape[0]
print("Number of Training Samples", training_size)
training_merge_df.head()

# Validation Data - Text and Genetic Variants Information
validation_merge_df = get_data(
    text_file_path="raw/test_text",
    variants_file_path="raw/test_variants",
    solution_file_path="raw/stage1_solution_filtered.csv",
)
validation_size = validation_merge_df.shape[0]
print("Number of Validation Samples:", validation_size)

raw_data_df = training_merge_df.append(validation_merge_df, sort=False)
Number of Training Samples 3316
Number of Validation Samples: 367

Class Definitions:

  • 1: Likely Loss-of-function
  • 2: Likely Gain-of-function
  • 3: Neutral
  • 4: Loss-of-function
  • 5: ...

Classification Example:

In [4]:
raw_data_df[raw_data_df["Variation"] == "V391I"]
Out[4]:
ID Gene Variation Class Text
5 5 CBL V391I 4 Oncogenic mutations in the monomeric Casitas B...
In [5]:
raw_data_df[raw_data_df["Variation"] == "V391I"]["Text"].tolist()[0][
    31228 - 35 : 31228 + 43
]
Out[5]:
'mutations (L399V, G375P, P395A and V391I) which attenuated the CBL E3 activity'

In the text belonging to the CBL V391I genetic variation we could find the section ''mutations (L399V, G375P, P395A and V391I) which attenuated the CBL E3 activity'. This reflects label 4, indicating a loss of function.

Explore Raw Data

In [9]:
ProfileReport(raw_data_df).to_notebook_iframe()

Load Data with Additional Features

In [10]:
train_processed = pd.read_csv(
    os.path.join(data_path, "interim/training_data_additional_features")
)
In [11]:
ProfileReport(train_processed).to_notebook_iframe()